In [1]:
Copied!
import gwaslab as gl
import pandas as pd
import gwaslab as gl
import pandas as pd
Download UKB-PPP Supplementary tables and load ST9¶
In [2]:
Copied!
sig_pqtl = pd.read_excel("41586_2023_6592_MOESM3_ESM.xlsx",sheet_name="ST9",skiprows=4)
sig_pqtl = pd.read_excel("41586_2023_6592_MOESM3_ESM.xlsx",sheet_name="ST9",skiprows=4)
Load into gwaslab Sumstats object¶
In [3]:
Copied!
leads_gls = gl.Sumstats(sig_pqtl,
snpid="Variant ID (CHROM:GENPOS (hg37):A0:A1:imp:v1)",
chrom="CHROM",
pos="GENPOS (hg38)",
mlog10p="log10(p) (discovery)",
other=["Assay Target"],
build="38")
leads_gls.data
leads_gls = gl.Sumstats(sig_pqtl,
snpid="Variant ID (CHROM:GENPOS (hg37):A0:A1:imp:v1)",
chrom="CHROM",
pos="GENPOS (hg38)",
mlog10p="log10(p) (discovery)",
other=["Assay Target"],
build="38")
leads_gls.data
2024/10/31 17:14:51 GWASLab v3.5.0 https://cloufield.github.io/gwaslab/ 2024/10/31 17:14:51 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com 2024/10/31 17:14:51 Start to initialize gl.Sumstats from pandas DataFrame ... 2024/10/31 17:14:51 -Reading columns : log10(p) (discovery),Variant ID (CHROM:GENPOS (hg37):A0:A1:imp:v1),GENPOS (hg38),CHROM,Assay Target 2024/10/31 17:14:51 -Renaming columns to : MLOG10P,SNPID,POS,CHR,Assay Target 2024/10/31 17:14:51 -Current Dataframe shape : 14287 x 5 2024/10/31 17:14:51 -Initiating a status column: STATUS ... 2024/10/31 17:14:51 -Genomic coordinates are based on GRCh38/hg38... 2024/10/31 17:14:52 Start to reorder the columns...v3.5.0 2024/10/31 17:14:52 -Current Dataframe shape : 14287 x 6 ; Memory usage: 22.07 MB 2024/10/31 17:14:52 -Reordering columns to : SNPID,CHR,POS,MLOG10P,STATUS,Assay Target 2024/10/31 17:14:52 Finished reordering the columns. 2024/10/31 17:14:52 -Column : SNPID CHR POS MLOG10P STATUS Assay Target 2024/10/31 17:14:52 -DType : object string int64 float64 category object 2024/10/31 17:14:52 -Verified: T F T T T NA 2024/10/31 17:14:52 #WARNING! Columns with possibly incompatible dtypes: CHR 2024/10/31 17:14:52 -Current Dataframe memory usage: 22.07 MB 2024/10/31 17:14:52 Finished loading data successfully!
Out[3]:
| SNPID | CHR | POS | MLOG10P | STATUS | Assay Target | |
|---|---|---|---|---|---|---|
| 0 | 2:27730940:T:C:imp:v1 | 2 | 27508073 | 79.2414 | 3899999 | A1BG |
| 1 | 7:73012042:G:A:imp:v1 | 7 | 73597712 | 31.5131 | 3899999 | A1BG |
| 2 | 12:104000470:T:C:imp:v1 | 12 | 103606692 | 11.8431 | 3899999 | A1BG |
| 3 | 19:58861808:A:G:imp:v1 | 19 | 58350442 | 695.1800 | 3899999 | A1BG |
| 4 | 3:56849749:T:C:imp:v1 | 3 | 56815721 | 14.5676 | 3899999 | AAMDC |
| ... | ... | ... | ... | ... | ... | ... |
| 14282 | 7:76017997:C:T:imp:v1 | 7 | 76388680 | 4415.1200 | 3899999 | ZP3 |
| 14283 | 10:28912850:C:G:imp:v1 | 10 | 28623921 | 90.1242 | 3899999 | ZP3 |
| 14284 | 16:30666367:C:T:imp:v1 | 16 | 30655046 | 11.0429 | 3899999 | ZP3 |
| 14285 | 17:7435040:G:A:imp:v1 | 17 | 7531723 | 25.1130 | 3899999 | ZP3 |
| 14286 | 19:1646712:C:T:imp:v1 | 19 | 1646713 | 26.0150 | 3899999 | ZP3 |
14287 rows × 6 columns
Load mapping information¶
In [4]:
Copied!
mapping = pd.read_excel("41586_2023_6592_MOESM3_ESM.xlsx",sheet_name="ST3",skiprows=2)
mapping
mapping = pd.read_excel("41586_2023_6592_MOESM3_ESM.xlsx",sheet_name="ST3",skiprows=2)
mapping
Out[4]:
| UKBPPP ProteinID | Olink ID | Assay Target | Protein panel | Gene symbol | UniProt | Gene CHROM | Gene start | Gene end | Dilution factor | % of samples below LOD | Coefficient of variation (median %) | % of data failing QC | ICC (interclass correlation) of sample age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A1BG:P04217:OID30771:v1 | OID30771 | A1BG | Inflammation_II | A1BG | P04217 | 19 | 58345178 | 58353492 | 1:100000 | 0.000807 | 3.910273 | 0.000000 | 0.003475 |
| 1 | AAMDC:Q9H7C9:OID30236:v1 | OID30236 | AAMDC | Cardiometabolic_II | AAMDC | Q9H7C9 | 11 | 77821109 | 77918432 | 1:1 | 0.008185 | 5.368324 | 0.000000 | 0.048915 |
| 2 | AARSD1:Q9BTE6:OID21311:v1 | OID21311 | AARSD1 | Oncology | AARSD1 | Q9BTE6 | 17 | 42950526 | 42964498 | 1:1 | 0.000446 | 4.720118 | 0.000000 | 0.092459 |
| 3 | ABCA2:Q9BZC7:OID30146:v1 | OID30146 | ABCA2 | Cardiometabolic_II | ABCA2 | Q9BZC7 | 9 | 137007234 | 137028915 | 1:1 | 0.031455 | 12.061490 | 0.000000 | 0.003752 |
| 4 | ABHD14B:Q96IU4:OID20921:v1 | OID20921 | ABHD14B | Neurology | ABHD14B | Q96IU4 | 3 | 51968510 | 51983409 | 1:1 | 0.000343 | 8.802544 | 0.003003 | 0.070720 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2936 | ZNRD2:O60232:OID31450:v1 | OID31450 | ZNRD2 | Oncology_II | ZNRD2 | O60232 | 11 | 65570460 | 65573942 | 1:1 | 0.074201 | 6.992866 | 0.000000 | 0.049257 |
| 2937 | ZNRF4:Q8WWF5:OID31347:v1 | OID31347 | ZNRF4 | Oncology_II | ZNRF4 | Q8WWF5 | 19 | 5455417 | 5456856 | 1:1 | 0.141641 | 7.861523 | 0.000000 | 0.018058 |
| 2938 | ZP3:P21754:OID30265:v1 | OID30265 | ZP3 | Cardiometabolic_II | ZP3 | P21754 | 7 | 76397518 | 76442071 | 1:1 | 0.007139 | 5.090834 | 0.000000 | 0.001365 |
| 2939 | ZP4:Q12836:OID31185:v1 | OID31185 | ZP4 | Oncology_II | ZP4 | Q12836 | 1 | 237877864 | 237890922 | 1:1 | 0.860881 | 6.884648 | 0.000000 | 0.001086 |
| 2940 | ZPR1:O75312:OID30801:v1 | OID30801 | ZPR1 | Neurology_II | ZPR1 | O75312 | 11 | 116773799 | 116788039 | 1:1 | 0.528915 | 9.293214 | 0.000000 | 0.002035 |
2941 rows × 14 columns
Load mapping information into gwaslab Sumstats object to fix CHR¶
Note: for simplicity we skip 15 proteins (AMY1A_AMY1B_AMY1C,BOLA2_BOLA2B,CGB3_CGB5_CGB8,CKMT1A_CKMT1B,DEFA1_DEFA1B,DEFB103A_DEFB103B,DEFB104A_DEFB104B,DEFB4A_DEFB4B,EBI3_IL27,FUT3_FUT5,IL12A_IL12B,LGALS7_LGALS7B,MICB_MICA,SPACA5_SPACA5B)
In [5]:
Copied!
mapping_gls = gl.Sumstats(mapping,
chrom="Gene CHROM",
other=["Gene start","Gene end","Assay Target"],
verbose=False)
mapping_gls.fix_chr(remove=True)
mapping_gls = gl.Sumstats(mapping,
chrom="Gene CHROM",
other=["Gene start","Gene end","Assay Target"],
verbose=False)
mapping_gls.fix_chr(remove=True)
2024/10/31 17:14:54 Start to fix chromosome notation (CHR)...v3.5.0
2024/10/31 17:14:54 -Current Dataframe shape : 2941 x 5 ; Memory usage: 21.57 MB
2024/10/31 17:14:54 -Checking CHR data type...
2024/10/31 17:14:54 -Variants with standardized chromosome notation: 2838
2024/10/31 17:14:54 -Variants with fixable chromosome notations: 88
2024/10/31 17:14:54 -Variants with invalid chromosome notations: 15
2024/10/31 17:14:54 -A look at invalid chromosome notations: {'19;19;19', '1;1;1', '15;15', 'X;X', '16;16'}
2024/10/31 17:14:54 -Identifying non-autosomal chromosomes : X, Y, and MT ...
2024/10/31 17:14:54 -Identified 88 variants on sex chromosomes...
2024/10/31 17:14:54 -Standardizing sex chromosome notations: X to 23...
2024/10/31 17:14:57 -Valid CHR list: 1 - 25
2024/10/31 17:14:57 -Removed 15 variants with chromosome notations not in CHR list.
2024/10/31 17:14:57 -A look at chromosome notations not in CHR list: {<NA>}
2024/10/31 17:14:57 Finished fixing chromosome notation (CHR).
Rename to START and END¶
In [6]:
Copied!
mapping_gls.data = mapping_gls.data.rename(columns={"Gene start":"START","Gene end":"END"})
mapping_gls.data = mapping_gls.data.rename(columns={"Gene start":"START","Gene end":"END"})
Check if cis or trans¶
In [7]:
Copied!
leads_cis_trans = leads_gls.check_cis(known=mapping_gls.data,
group_key="Assay Target",
gls=True)
leads_cis_trans.data
leads_cis_trans = leads_gls.check_cis(known=mapping_gls.data,
group_key="Assay Target",
gls=True)
leads_cis_trans.data
2024/10/31 17:14:57 Start to check if variants are in cis or trans regions...v3.5.0 2024/10/31 17:14:57 -Current Dataframe shape : 14287 x 6 ; Memory usage: 22.07 MB 2024/10/31 17:14:57 -Number of groups in sumstats:2414 2024/10/31 17:14:57 -Number of groups in reference:2908 2024/10/31 17:14:57 -Checking if variants in cis/trans regions grouped by Assay Target... 2024/10/31 17:14:57 -Window size in kb adding to start and end: 500... 2024/10/31 17:14:57 -Groups not in reference: AMY1A_AMY1B_AMY1C,BOLA2_BOLA2B,CGB3_CGB5_CGB8,CKMT1A_CKMT1B,DEFA1_DEFA1B,DEFB103A_DEFB103B,DEFB104A_DEFB104B,DEFB4A_DEFB4B,EBI3_IL27,FUT3_FUT5,IL12A_IL12B,LGALS7_LGALS7B,MICB_MICA,SPACA5_SPACA5B 2024/10/31 17:14:58 -Number of Cis variants: 1921 2024/10/31 17:14:58 -Number of Trans variants: 12252 2024/10/31 17:14:58 -Number of NoReference variants: 114 2024/10/31 17:14:58 Finished checking if variants are in cis or trans regions.
Out[7]:
| SNPID | CHR | POS | MLOG10P | STATUS | Assay Target | CIS/TRANS | REF_CHR | REF_START | REF_END | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2:27730940:T:C:imp:v1 | 2 | 27508073 | 79.2414 | 3899999 | A1BG | Trans | 19 | 58345178 | 58353492 |
| 1 | 7:73012042:G:A:imp:v1 | 7 | 73597712 | 31.5131 | 3899999 | A1BG | Trans | 19 | 58345178 | 58353492 |
| 2 | 12:104000470:T:C:imp:v1 | 12 | 103606692 | 11.8431 | 3899999 | A1BG | Trans | 19 | 58345178 | 58353492 |
| 3 | 19:58861808:A:G:imp:v1 | 19 | 58350442 | 695.1800 | 3899999 | A1BG | Cis | 19 | 58345178 | 58353492 |
| 4 | 3:56849749:T:C:imp:v1 | 3 | 56815721 | 14.5676 | 3899999 | AAMDC | Trans | 11 | 77821109 | 77918432 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14282 | 7:76017997:C:T:imp:v1 | 7 | 76388680 | 4415.1200 | 3899999 | ZP3 | Cis | 7 | 76397518 | 76442071 |
| 14283 | 10:28912850:C:G:imp:v1 | 10 | 28623921 | 90.1242 | 3899999 | ZP3 | Trans | 7 | 76397518 | 76442071 |
| 14284 | 16:30666367:C:T:imp:v1 | 16 | 30655046 | 11.0429 | 3899999 | ZP3 | Trans | 7 | 76397518 | 76442071 |
| 14285 | 17:7435040:G:A:imp:v1 | 17 | 7531723 | 25.1130 | 3899999 | ZP3 | Trans | 7 | 76397518 | 76442071 |
| 14286 | 19:1646712:C:T:imp:v1 | 19 | 1646713 | 26.0150 | 3899999 | ZP3 | Trans | 7 | 76397518 | 76442071 |
14287 rows × 10 columns
Plot¶
In [8]:
Copied!
leads_cis_trans.plot_gwheatmap(
sizes=(20,100),
ychrpad=0.3,
scaled=True,
xchrpad=0.2,
hspace=0.05,
cut=20,
build="38",
cutfactor=30,
sig_level_lead=1e-20,
bwindowsizekb=100,
windowsizekb=500000,
anno="GENENAME",
fig_kwargs={"figsize":(15,10),"dpi":400},
add_b=True )
leads_cis_trans.plot_gwheatmap(
sizes=(20,100),
ychrpad=0.3,
scaled=True,
xchrpad=0.2,
hspace=0.05,
cut=20,
build="38",
cutfactor=30,
sig_level_lead=1e-20,
bwindowsizekb=100,
windowsizekb=500000,
anno="GENENAME",
fig_kwargs={"figsize":(15,10),"dpi":400},
add_b=True )
2024/10/31 17:14:58 Start to create genome-wide scatter plot...v3.5.0: 2024/10/31 17:14:58 -Dropping 114 variants due to missing values in ['POS', 'CHR', 'REF_START', 'REF_CHR']. 2024/10/31 17:14:58 -P values are already converted to -log10(P)! 2024/10/31 17:14:58 Start to create MQQ plot...v3.5.0: 2024/10/31 17:14:58 -Genomic coordinates version: 38... 2024/10/31 17:14:58 -Genome-wide significance level to plot is set to 5e-08 ... 2024/10/31 17:14:58 -Raw input contains 14173 variants... 2024/10/31 17:14:58 -MQQ plot layout mode is : b 2024/10/31 17:14:58 Finished loading specified columns from the sumstats. 2024/10/31 17:14:58 Start data conversion and sanity check: 2024/10/31 17:14:58 -Sanity check will be skipped. 2024/10/31 17:14:58 -Calculating DENSITY with windowsize of 100 kb 2024/10/31 17:14:59 -Converting data above cut line... 2024/10/31 17:14:59 -Maximum DENSITY value is 442.0 . 2024/10/31 17:14:59 -DENSITY values above 20 will be shrunk with a shrinkage factor of 30... 2024/10/31 17:14:59 Finished data conversion and sanity check. 2024/10/31 17:14:59 Start to create MQQ plot with 14173 variants... 2024/10/31 17:14:59 -Creating background plot... 2024/10/31 17:14:59 Finished creating MQQ plot successfully! 2024/10/31 17:14:59 Start to extract variants for annotation... 2024/10/31 17:14:59 Start to annotate variants with nearest gene name(s)... 2024/10/31 17:14:59 -Assigning Gene name using ensembl_hg38_gtf for protein coding genes 2024/10/31 17:14:59 Finished annotating variants with nearest gene name(s) successfully! 2024/10/31 17:14:59 Finished extracting variants for annotation... 2024/10/31 17:14:59 Start to process figure arts. 2024/10/31 17:14:59 -Processing X ticks... 2024/10/31 17:14:59 -Processing X labels... 2024/10/31 17:14:59 -Processing Y labels... 2024/10/31 17:14:59 -Processing Y tick lables... 2024/10/31 17:14:59 -Processing Y labels... 2024/10/31 17:14:59 -Processing lines... 2024/10/31 17:14:59 -Plotting horizontal line ( mean DENISTY): y = 20.08280789693532 2024/10/31 17:14:59 -Plotting horizontal line ( median DENISTY): y = 3.0 2024/10/31 17:14:59 Finished processing figure arts. 2024/10/31 17:14:59 Start to annotate variants... 2024/10/31 17:14:59 -Annotating using column GENENAME... 2024/10/31 17:14:59 -Adjusting text positions with repel_force=0.03... 2024/10/31 17:14:59 Finished annotating variants. 2024/10/31 17:14:59 Start to save figure... 2024/10/31 17:14:59 -Skip saving figure! 2024/10/31 17:14:59 Finished saving figure... 2024/10/31 17:15:00 Finished creating plot successfully! 2024/10/31 17:15:00 -Processing X ticks... 2024/10/31 17:15:00 -Processing Y ticks... 2024/10/31 17:15:00 Start to save figure... 2024/10/31 17:15:01 -Saved to ./gwaslab_genome_wide_heatmap_20241031_1.png successfully! 2024/10/31 17:15:01 Finished saving figure...
Out[8]:
(<Figure size 6000x4000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7f1e758a6a60>)
In [ ]:
Copied!